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Abstract: 

The hydrophobic/polar HP model on the square lattice has been widely used to inves- 
tigate basics of protein folding. In the cases where all designing sequences (sequences 
with unique ground states) were enumerated without restrictions on the number of 
contacts, the upper limit on the chain length N has been 18-20 because of the rapid 
exponential growth of the numbers of conformations and sequences. We show how a 
few optimizations push this limit by about 5 units. Based on these calculations, we 
study the statistical distribution of hydrophobicity along designing sequences. We 
find that the average number of hydrophobic and polar clumps along the chains is 
larger for designing sequences than for random ones, which is in agreement with ear- 
lier findings for N < 18 and with results for real enzymes. We also show that this 
deviation from randomness disappears if the calculations are restricted to maximally 
compact structures. 
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1 Introduction 



Coarse-grained models are an important tool in theoretical studies of protein folding, 
for computational as well as conceptual reasons, and have been used to gain insights 
into the physical principles of folding (for a recent review, see Ref. [0). These models 
are often lattice based. The main advantage of using a discrete conformational space 
is that exact calculations can be performed for short chains, by exhaustive enumera- 
tion of all possible conformations. One model that has been extensively studied this 
way is the minimalistic hydrophobic/polar HP model of Lau and Dill |2j| on the square 
lattice. In previous work on this model, all sequences with unique ground states were 
determined for chains with up to iV = 18-20 monomers 0-0. Such sequences are 
called designing; they design their ground state conformations. 

In this paper, we show how a few optimizations make it possible to extend these 
calculations to iV = 25, which corresponds to a 4000-fold increase in the number 
of possible sequence, conformation pairs.Q We then use this data set to address the 
question of how designing sequences differ from random ones statistically. 

By analyzing the behavior of block variables, it has been found that designing N < 18 
HP sequences as well as real (globular) protein sequences |], |7| show negative 
hydrophobicity correlations. Therefore, one expects to find an increased number of 
hydrophobic and polar clumps along these chains. In this paper, we show that the 
average number of clumps is indeed larger for designing HP sequences than for random 
sequences. In particular, this implies that the earlier finding that designing sequences 
show negative hydrophobicity correlations remains unaffected when increasing N to 
25. This provides a non-trivial test of the robustness of this property. 

In lattice model studies it is not uncommon to consider only maximally compact 
conformations, which for N = 25 are confined to a 5 x 5 square. This drastic reduction 
of conformational degrees of freedom leads to a sharp rise in the number of designing 
sequences. An interesting question is whether this reduction also affects the statistical 
properties of designing sequences. To address this question, we repeat the same 
statistical analysis for sequences that are designing when only maximally compact 
conformations are used. The results turn out to be qualitatively different in this 
case. In particular, this means that the agreement with the results obtained for real 
sequences gets lost when this reduced conformational space is used. 
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The complete list of all designi ng N < 25 sequences and the corresponding conform ations will 



be made electronically available at tittp : //www . thep . lu . se/complex/wwwserver . html 
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Finally, we also study the character of the folding transition for one of the designing 
iV = 25 sequences, which was selected by an optimization procedure. The thermo- 
dynamic behavior of this sequence is studied using Monte Carlo simulations. 

The paper is organized as follows. In Section |2| we define the model and describe 
the algorithm and optimizations used for finding all designing sequences with N < 
25. Our results are discussed in Section |||, which contains sequence and structure 
statistics, the statistical analysis of designing sequences, and the thermodynamic 
study of an optimized sequence. A summary is given in Section [|. 



2 Enumerating designing sequences 



In lattice models of proteins it is common to use a contact potential. This means 
that the energy that a sequence gets with a certain conformation, is given by what 
contacts exist in that conformation. That is, 

E = J £C ij U(a i ,<r j ) (1) 

where the contact map is defined as 

„ f 1 if monomers i, j are neighbors on the lattice but \i — j\ ^ 1 /ri , 
J 10 otherwise 

and U(ai,<jj) is the interaction matrix. In the HP model there are two amino acids, 
hydrophobic (H) and polar (P). The interaction matrix is 



f=r (3> 




with H in the first row and column, so the energy is determined exclusively by the 
number of HH contacts. 

Two conformations that have identical contact sets will have the same energy for 
all sequences, so they can be represented by a single contact set which is marked 
as degenerate. What this means is that a large number of conformations can be 
reduced to a smaller number of contact sets. A contact set that corresponds to a 
single conformation will be called unique. 
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Table 1: Contact set terminology. 



Unique Corresponds to a single self-avoiding walk 

Degenerate Corresponds to more than one self-avoiding walk 

Maximal Not a subset of any other realizable contact set 

Redundant Can be ignored in the search for designing sequences 



2.1 Compressing conformational space 

To find all designing sequences, we first determine all relevant conformations, which 
then are combined with sequences in Section |2.2| . The conformation search is some- 
times simplified by only considering maximally compact conformations, where the 
number of contacts is maximal. Looking only at those conformations corresponds 
to shifting the energies by adding a large negative term to all elements of the in- 
teraction matrix. An efficient method for enumerating compact conformations was 
recently proposed by Kloczkowski and Jernigan ||. 

In this paper, we consider all possible self-avoiding walks, and not only those that are 
maximally compact. The space of all possible self-avoiding walks grows rapidly with 
N. Using contact sets rather than self-avoiding walks gives a significant reduction 
of the conformational space (see Table 2 below), but this reduction is not sufficient 
for our purposes; memory limitations prevented us from storing the complete list 
of all possible contact sets for N > 24. To be able to go to larger N, we therefore 
developed two procedures for reducing the number of contact sets to be stored. These 
procedures are based on the observation that as long as there are no repulsive forces 
(that is, as long as the elements of the interaction matrix are all non-positive), it is 
never energetically disadvantageous to add contacts as long as no existing contacts 
are broken. 

Before discussing these two procedures, it is helpful to introduce some terminology. A 
contact set will be called maximal if it is not a proper subset of any other (realizable) 
contact set. A conformation that is designed by at least one sequence is designable. It 
can be readily seen that every designable self-avoiding walk corresponds to a unique 
and maximal contact set. Another important class are contact sets that can be 
safely ignored in the search for designing sequences. Such contact sets will be called 
redundant. A summary of our contact set terminology can be found in Table 1. 
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2.1.1 Eliminating contact sets: Step 1 

As mentioned above, it was impossible for us to store the complete list of all contact 
sets for iV > 24. To circumvent this problem, we used a program that for each self- 
avoiding walk tries a carefully selected, predefined set of (local) moves. If any of these 
moves can be performed without destroying any existing contact (new contacts may 
form), the self-avoiding walk is discarded. All possible self-avoiding walks surviving 
this test are converted to contact sets. It is important to stress that the moves are 
chosen so that the resulting reduced list of contact sets has the property that any 
realizable contact set is a subset of some set in this list. In particular, the move set 
does not contain the inverse of any of its elements. 

The list of contact sets obtained this way was indeed small enough to be stored 
up to N = 25, but the procedure has the disadvantage that it may eliminate non- 
redundant contact sets. This is an unwanted property, but the problem is easy to 
solve. The solution is that once the sequences that have non-degenerate ground states 
with respect to the non- discarded conformations have been found, each sequence is 
combined with its conformation, and the chain is tested with the opposite of the 
tests used to discard conformations. That is, the program tests whether any of the 
opposite moves can be performed without breaking any of the existing HH contacts. 
By using the fact that the forces are repulsive, it can be seen that if no such move 
is possible, then the conformation has to be a unique ground state of this sequence. 
Hence, by performing this test, one can make sure that no sequence is falsely declared 
to be designing, even though there are non-redundant contact sets that are missing 
in the list used. 

It is also important to note that all the self-avoiding walks with a given contact set 
are never discarded if the set is maximal. This means that all the contact sets that 
correspond to designable conformations are included in the list generated by this 
procedure. This is important because it implies that this reduced list of contact sets 
can be used without losing any designing sequence. What could go wrong when using 
this list is that some non-designing sequences were classified as designing, but this is 
avoided by performing the test discussed above. 



2.1.2 Eliminating contact sets: Step 2 

Our second procedure removes contact sets from the list produced by the first pro- 
cedure. This is done to speed up the calculations. All contact sets that are removed 
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in this second step are redundant. 

The procedure relies on the fact that all pair energies are non-positive. To see how 
that can be used, consider one set of contacts A which is a subset of another contact 
set B. It is then clear that A cannot represent a unique ground state, because for any 
sequence, B has the same or lower energy. The set A is nevertheless non- redundant 
in case B is a unique contact set that would be falsely classified as the unique ground 
state of some sequence if A were removed. If, on the other hand, the set B is 
degenerate, then it follows that A must be redundant. 

Suppose now that A is a subset of two other sets B and C . Then, provided that C 
is kept, there can be no sequence such that A is needed in order to decide whether 
or not this sequence has B as a unique ground state. This follows immediately from 
the fact that for any given sequence, both B and C have energies at least as low as 
that of A. Hence, if a contact set is a subset of two or more sets, then it has to be 
redundant. In particular, this is true if the set is a subset of a subset. 

This reasoning gives us the following simple procedure for elimination of redundant 
contact sets. 

• For each set A, find all sets of which A is a subset. 

• Keep A if 

1. no such sets are found, or 

2. one set is found, and this set corresponds to one conformation. 

• Otherwise discard A. 

Note that those of the surviving contact sets that meet condition 1 are maximal. 

It should be stressed that, because all the contact sets removed by this procedure are 
redundant, one can still use the test in Section |2.1.1| to make sure that no sequence 
is falsely classified as designing. 

Implementing the rules described above requires some care, since the number of sets 
grows rapidly with N . The solution we used is based on storing the sets in a tree, 
where each node in the tree represents the question of whether or not a certain contact 
is included. Before building the tree, it is useful to sort the sets in descending order 
by the number of contacts they contain. This sorting ensures that a set, once added 
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to the tree, never has to be removed, since all sets of which it can be a subset are 
already added when it is examined. Having access to the tree, it is straightforward 
to search for supersets of a given set. One starts at the root of the tree and each 
time the tree branches one has to consider either one of the branches or both. The 
procedure is memory-consuming for large N (see example below), but the CPU time 
required is relatively modest. 



2.1.3 Examples 

To get an idea of the sizes of these different lists of contact sets, let us consider the case 
N = 18. For this N, there are 5808335 self-avoiding walks and 170670 contact sets. 
After applying the redundancy test in Section |2.1.2| to the list of all possible contact 
sets, we are left with 33223 maximal contact sets (condition 1) and 6609 contacts sets 
with one superset corresponding to one conformation (condition 2). Among the 33223 
maximal contact sets, there are 6181 sets representing more than one conformation. 
Subtracting the degenerate contact sets, we are left with 33223—6181 = 27042 contact 
sets corresponding to possibly designable conformations. Of those 27042 sets, 5660 
sets have a total of 6609 listed subsets. These subsets are needed because they may 
degenerate an otherwise unique ground state corresponding to one of the 5660 sets. 

Here, the redundancy test was applied to the complete list of all possible contact 
sets. Alternatively, we may first use the program described in Section [2.1.1| , which 
for iV = 18 generates a list of 51373 contact sets (which contains the 33223 maximal 
ones). When applying the redundancy test in Section |2.1.2| to this reduced list, we 
end up with 33223 + 449 contact sets. Note that with this approach, we find only 
449 of the 6609 non- redundant contact sets meeting condition 2. As a result, some 
conformations are erroneously found to be unique ground states. The test discussed 
in Section [2.1.1| removes these false unique ground states. 

The CPU times required to generate these different lists on a Pentium III 800 MHz 
were as follows. Generating the list of all possible contact sets, by exhaustive enu- 
meration, took 6 seconds, and reducing this list from 170670 to 33223 + 6609 contact 
sets required 7 seconds. The time needed to run the program that generates the 
list described in Section |2.1.1| was 3 seconds, and reducing this list from 51373 to 
33223 + 449 contact sets took 1 second. The corresponding two numbers for iV = 25 
were 40 and 60 minutes, respectively. In this case, building the tree used in the 
redundancy test required 220 megabytes of internal memory. 
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2.2 Searching sequence space 

We now turn our attention to the sequence space. For each of the 2 N sequences we 
wish to determine what set of contacts gives the lowest energy. If this ground state 
energy can be accomplished only by a single contact set, and if that set corresponds 
to a single conformation, the sequence designs that conformation. 



2.2.1 Organizing the search 

The most straightforward approach to finding all sequences with unique ground states 
is to go through all the sets of contacts for each sequence, and calculate the energy for 
each of the sets. By only storing the differences between consecutive sets of contacts, 
and by representing the sequences and contacts as numbers with one bit per position, 
the number of operations required for each combination can be kept small. 

It is, however, also possible to use a very different approach. Represent each sequence 
by a binary number, and consider any given set of contacts. Between two consecutive 
sequences two bits are toggled on the average, which indicates that using information 
about the previous sequence and its energy will be a lot faster than recalculating 
the energy from scratch. This approach can be used if the whole sequence space is 
examined for one contact set at a time. The downside to doing this is that information 
needs to be stored for every sequence until all the contact sets have been considered. 

Neither of the methods described above is bad, but they each contain an optimization 
that the other does not have. It is desirable to utilize both the similarity of consecutive 
sequences and the similarity of consecutive contact sets. The solution is to divide the 
sequence space into a number of blocks of fixed size, and apply the second method to 
each of those blocks. A block consists of 2 M sequences that have their first N — M 
residues in common. This part of the sequence will be referred to as the fixed part, and 
the remaining M positions make up the variable part. We call a contact active if it 
connects two H monomers, and note that for each contact there are three possibilities, 
depending on the position and type of the monomers it connects: 

• If both monomers are in the fixed part, the contact gives a contribution of — 1 
to the energy for all the sequences of this block, if and only if both monomers 
are H. 

• If at least one of the monomers is a P in the fixed part, the contact cannot be 
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active for any sequence in the block. 



• If both monomers are in the variable part, or if one of them is in the variable 
part and the other is an H in the fixed part, whether or not the contact is active 
depends on the variable part. 



2.2.2 The cutoff energy 

This leads us to the next optimization, which has to do with the possible ground state 
energies, and can be seen as a sequence-dependent reduction of the conformational 
space. Clearly, a non-degenerate ground state with N > 3 cannot have an energy 
of 0. More generally, it is unreasonable that a small number of active contacts should 
be enough to give an arbitrarily long polymer a unique ground state. To see how this 
can be used to speed up the calculations, consider some contact set and sequence 
block, such that there are p active contacts in the fixed part and q contacts whose 
state depends on the variable part. If — (p + q) is larger than some cutoff value E max , 
none of the sequences in this block can have a unique ground state for this contact set. 
The major problem that arises with this optimization is to know what value to use 
for E max ; the algorithm will find only those unique ground states that have energies 
E < E max . For N < 20 all energies have been considered in the calculations, and it 
turns out that there are no unique ground states with E > — 4 for 15 < N < 20 (see 
Table 3 below). For N > 20 we have not proven that there can be no unique ground 
states with E > —4, but it seems very reasonable that if there are any, most or all 
of them would have E = —3. Therefore we have used a cutoff energy E max = —3 for 
iV > 20. It turns out that for 20 < N < 25 there are no unique ground states with 
energy —3, and this strongly indicates that —4 is the highest possible ground state 
energy for any HP chain with N > 15. 

To illustrate how the computer time required varies with E max , let us again take 
N = 18 as an example. For this N, the sequence search is found to be about four 
times faster with E max = —4 than with E max = — 1. With E max = —4, the total time 
needed to find the 6349 designing N = 18 sequences, including the time required to 
generate the conformations, is a few minutes on a Pentium III 800 MHz. 
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3 Results 



Using the algorithm and optimizations described above, it was possible to determine 
all designing sequences for N < 25 within a reasonable amount of time. Previous 
work has covered iV < 18 for the normal HP model || and iV < 20 for shifted 
HP models j|, [|]. The increase in N corresponds roughly to a 100-fold increase 
in the number of known designing sequences and conformations. This gives better 
confidence when doing statistics on the designing sequences, and it makes it possible 
to study how properties of the model depend on protein size. 



3.1 Sequence and structure statistics 



Sequence and structure statistics for 4 < N < 25 are summarized in Tables 2 and 
3. Column three of Table 2 shows the total number of contact sets, which has been 
studied before ||. It was estimated to grow as fi N with \i = 2.29 ± 0.02 for large 
iV |J. The number of maximal contact sets, column four of Table 2, appears to grow 
exponentially too, but slightly slower. A fit of our data for 15 < iV < 25 to the 
form fi N yields fi ~ 2.07. This growth with N is considerably slower than that of 
the number of self-avoiding walks, for which the best available estimate is ~ iV 7-1 /!^ 



with 7 = 43/32 and /i = 2.6381585 |0|. 



The fifth column of Table 2 shows the number of designing sequences. It turns out 
that the fraction of designing sequences varies between 2.27 and 2.57% for 19 < 



iV < 25, which is in line with previous results for smaller N [IT|. The last column 
of Table 2 shows the number of designable conformations. The designability of a 
conformation is the number of sequences that designs it. From Table 2 it can be seen 
that the average designability of the conformations that are designable grows with 
N and is 765147/107336 « 7.1 for iV = 25. 

In Table 3 we show the distribution of ground state energies for different N, which 
is crucial for the optimization discussed in Section |2.2.2j . 



Figure 1 shows three designable N = 25 conformations. The conformation (a) is the 
most designable one for N = 25, with a designability of 326, whereas conformation 
(c) is designed by one sequence only. 

For comparison, we also determined the sequences that are designing when only 
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Table 2: The number of designing sequences Sn and designable conformations Djy for 
the HP model on a square lattice. The third column shows the total number of contact 
sets, as obtained by exhaustive enumeration of all possible self- avoiding walks. Memory 
requirements (see Sec |2.1.1| ) prevented us from counting them for N > 24. Column four 

sho ws the number of maximal contact sets. 

Maximal 



N 


Conformations 


Contact sets 


contact sets 


Sn 


D N 


4 


5 


2 


1 


4 


1 


5 


13 


3 


2 








6 


36 


8 


4 


7 


3 


7 


98 


14 


9 


10 


2 


8 


272 


41 


20 


7 


5 


9 


740 


78 


39 


6 


4 


10 


2034 


212 


95 


6 


4 


11 


5513 


424 


174 


62 


14 


12 


15037 


1113 


420 


87 


25 


13 


40617 


2309 


779 


173 


52 


14 


110188 


5953 


1818 


386 


130 


15 


296806 


12495 


3409 


857 


218 


16 


802075 


31940 


7810 


1539 


456 


17 


2155667 


67389 


14717 


3404 


787 


18 


5808335 


170670 


33223 


6349 


1475 


19 


15582342 


363010 


63434 


13454 


2726 


20 


41889578 


910972 


140939 


24900 


5310 


21 


112212146 


1953847 


273049 


52183 


9156 


22 


301100754 


4868343 


599821 


97478 


17881 


23 


805570061 


10513774 


1174460 


199290 


31466 


24 


2158326727 




2561884 


380382 


61086 


25 


5768299665 




5057733 


765147 


107336 



maximally compact conformations are used. In this case, it turns out that there are 
6181800 designing N = 25 sequences. The corresponding number is 765147 when 
the full conformational space is used (see Table 2), so the ratio between the number 
of designing sequences in the maximally compact ensemble and the number of truly 
designing sequences is 6181800/765147 ~ 8.1, for N = 25. This ratio has previously 



been shown [I2| to fluctuate between approximately 4 and 11 for iV = 11 through 
AT =18. 

It is worth noting that among the 765147 iV = 25 sequences that are designing 
when the full conformational space is used, there are only 605 sequences that design 
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Table 3: The number of designing sequences for different N and ground state energies. 



N 


Energy 
































-1 


-2 


-3 


-4 


-5 


-6 


-7 


-8 


-9 


-10 


-11 


-12 


-13 


-14 


-15 


-16 


4 
5 
6 


4 



7 






























7 

8 






10 



7 




























9 








8 




























10 








1 


5 


























11 








6 


51 


2 
























12 








2 


27 


49 


9 






















13 











78 


54 


41 






















14 








2 


53 


110 


132 


89 




















15 











58 


88 


355 


330 


26 


















16 











43 


158 


250 


638 


417 


33 
















17 











.33 


160 


662 


882 


1337 


330 
















18 











21 


149 


623 


1431 


2021 


1676 


425 














19 











8 


154 


971 


1936 


4996 


3324 


2007 


58 












20 











13 


117 


955 


2573 


5582 


7665 


5415 


2481 


69 










21 









17 


134 


1312 


3116 


11670 


12132 


13917 


8898 


987 










22 









12 


120 


1116 


3802 


11672 


22386 


24171 


22394 


10610 


1195 








2.3 









26 


92 


1547 


4204 


21050 


29944 


56902 


44940 


31961 


8118 


506 






24 









17 


134 


1321 


4916 


21096 


48017 


78496 


100746 


75346 


40376 


9596 


321 




25 









20 


64 


1708 


5270 


32484 


59470 


158044 


159704 


191377 


102944 


46386 


7688 


6 




Figure 1: Some designable structures with TV = 25: (a) the most designable structure, 
(b) a maximally compact structure, and (c) a structure with few contacts. Filled circles 
represent H. 

maximally compact conformations. Furthermore, it turns out that no maximally 
compact conformation is designed by more than 10 sequences, whereas the most 
designable conformation, as mentioned above, has a designability of 326. In fact, 
there are 19360 conformations that are more designable than the most designable 
one among the maximally compact conformations. 

It is interesting to compare these results to those of Shahrezaei and Ejtehadi ||, 
who studied a shifted HP model where the interaction matrix is given by U (H, H) = 
-2 - 7 - E c , U(R, P) = -1 - E c and U(P, P) = -E c . Using the full conformational 
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ensemble, these authors found that the set of highly designable conformations was 
independent of the parameters 7 and E c . In particular, this suggests that the set 
of highly designable conformations should remain the same when only maximally 
compact conformations are considered. Our results show that this conclusion does 
not hold in the original, unshifted HP model. 



3.2 Statistical properties of designing sequences 



In this section, we study the statistical properties of designing sequences by moni- 
toring two different quantities. The first one is the total hydrophobicity M, defined 

as 

JV 1 1 „ 

1 + <x 



M = (4) 



i=i 



where a l = 1 (H) or <j,i = — 1 (P). Our second quantity is the number j of hydrophobic 



and polar clumps along the chain |T3| , which can be written as 



. N+l l^ 1 
z z i=i 

A similar analysis has previously been performed for N < 18 ||. By analyzing the 
fluctuations of block variables, evidence was found for negative <Ji,<Jj correlations. 
Therefore, we expect the average j to be larger for designing sequences than for 
random ones. 



In Figure 2a we show the relative abundance of hydrophobic amino acids, (M)/N, 
as a function of N, where (•) denotes an average for fixed N. For the sequences 
that are designing when the full conformational space is used, we see that the N 
dependence of (M)/N is quite weak if N is not too small, which confirms a trend 
seen earlier 0. Furthermore, we see that these sequences, as expected, are more 
hydrophobic than those that are designing when only maximally compact conforma- 
tions are used. Finally, it can also be seen that the sequences that design maximally 
compact conformations differ greatly from those that are designing when only such 
conformations are used. Figure 2b shows the frequency of different M for N = 25. 



In Figure 3 we show the results of our clump analysis for N = 25. The average number 
of clumps for fixed M (and N), (j)m, is indeed found to be larger for designing 
sequences than for random ones, which is in nice qualitative agreement with previous 
results for real protein sequences and model sequences with smaller iV ||[7j . Sequences 
that are designing when the interaction energies are shifted so far that only the 
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5 10 15 20 25 5 10 15 20 25 



N M 

Figure 2: (a) The average hydrophobicity (M) /N as a function of N and (b) the frequency 
/a/ of different M for N = 25. Shown are the results for all designing sequences (+), all 
sequences that are designing when only maximally compact conformations are considered 
(x), and sequences that design maximally compact conformations when all conformations 
are considered (□). The dashed lines represent random sequences. 




6 - / x 
4 - / 

2 -f V 

I ' 1 1 1 

5 10 15 20 25 

M 

Figure 3: The number of clumps, {j)mi as a function of the total hydrophobicity M for 
N = 25. Shown are the results for sequences that are designing when all conformations 
(+) and only maximally compact ones (x), respectively, are considered. The dashed line 
represents random sequences. 

1081 maximally compact conformations need to be considered, have, by contrast, a 
0)m close to that of random sequences. Hence, the results obtained from studying 
only maximally compact conformations seem to be of less relevance, with respect 
to the applicability to real proteins, than the results obtained when considering all 
conformations. 

Finally, we note that Buchler and Goldstein have compared various lattice models 
and found arguments against the use of only two letters [fl~4l,|i~5fl, as in the HP model. 
These findings were all based on calculations for maximally compact conformations. 
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However, it is not known to what extent the results of such calculations remain 
valid when the full set of conformations is used. Our results show that, in the HP 
model, both the set of highly designable conformations and the statistical properties 
of designing sequences depend strongly on which of the two conformational ensembles 
is used. 



3.3 The character of the folding transition 



The model studied in this paper has the important feature that there exists a signif- 
icant number of designing sequences (this is not true on the triangular [IS] and cu- 



bic [[H]] lattices), and that the corresponding conformations tend to show protein-like 
regularities |18[]. However, as in other two-dimensional models, the folding transition 
is not protein-like in character for the typical sequence; the folding process is not 
cooperative enough |19|] . On the other hand, the folding behavior is, at least to some 
extent, sequence dependent, and therefore we decided to look into the thermodynamic 
behavior of a carefully chosen sequence. 

This sequence was obtained by applying a Monte Carlo-based sequence design algo- 



rithm []5D| to the 326 sequences that design the most designable N = 25 conformation 
(see Figure la). The design algorithm maximizes the stability of a given conforma- 
tion with respect to sequence at a fixed non-zero temperature. The sequence we 
obtained by using this method is shown in Figure la. Subsequently, this sequence 
was subjected to Monte Carlo simulations at different temperatures. In Figure 4 we 
show the temperature-dependence of the specific heat, which is found to exhibit a 
pronounced peak. Also shown is the distribution of energy at kT = 0.479, which is 
just above the specific-heat maximum. The energy distribution has one peak corre- 
sponding to the ground state, at E = —13, and another, broader peak centered at 
E pa —8. The coexistence of these states implies that the folding transition is much 
more cooperative for this sequence than for the typical sequence in this model. 



4 Summary 



By greatly reducing the conformational space, and by carefully optimizing the se- 
quence space exploration, we were able to decrease the time needed to exhaustively 
search for designing sequences with N = 18 roughly thousandfold compared to the 
most naive methods. It became feasible to find all designing sequences for iV as large 
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Figure 4: Results from Monte Carlo simulations of the sequence shown in Fig. la. (a) 
Temperature dependence of the specific heat C v = ((E 2 ) — (E) 2 )/kT 2 . The line is an 
extrapolation obtained by umbrella sampling [^TJ. (b) Histogram of energy at kT = 0.479. 

as 25 using only a small number of workstations. The results obtained by doing this 
were used to look at the statistical properties of designing sequences. We found that 
the average number of hydrophobic and polar clumps along the chains is larger for 
designing sequences than for random ones. In particular, this means that the find- 
ing that designing HP sequences, like real enzymes, show negative hydrophobicity 
correlations |||7] remains unaffected when increasing N from 18 to 25. By contrast, 
qualitatively different results were obtained when discarding conformations that are 
not maximally compact. This is of interest because restrictions to compact structures 
are common in both lattice model studies and determinations of statistical potentials 
from known protein structures. Finally, we saw an example of a folding behavior that 
is more cooperative than for the typical sequence in this model. 
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